Search CORE

19 research outputs found

Optimizing text mining methods for improving biomedical natural language processing

Author: Mehryary Farrokh
Publication venue: fi=Turun yliopisto|en=University of Turku|
Publication date: 04/02/2022
Field of study

The overwhelming amount and the increasing rate of publication in the biomedical domain make it difficult for life sciences researchers to acquire and maintain all information that is necessary for their research. Pubmed (the primary citation database for the biomedical literature) currently contains over 21 million article abstracts and more than one million of them were published in 2020 alone. Even though existing article databases provide capable keyword search services, typical everyday-life queries usually return thousands of relevant articles. For instance, a cancer research scientist may need to acquire a complete list of genes that interact with BRCA1 (breast cancer 1) gene. The PubMed keyword search for BRCA1 returns over 16,500 article abstracts, making manual inspection of the retrieved documents impractical. Missing even one of the interacting gene partners in this scenario may jeopardize successful development of a potential new drug or vaccine. Although manually curated databases of biomolecular interactions exist, they are usually not up-to-date and they require notable human effort to maintain. To summarize, new discoveries are constantly being shared within the community via scientific publishing, but unfortunately the probability of missing vital information for research in life sciences is increasing. In response to this problem, the biomedical natural language processing (BioNLP) community of researchers has emerged and strives to assist life sciences researchers by building modern language processing and text mining tools that can be applied at large-scale and scan the whole publicly available literature and extract, classify, and aggregate the information found within, thus keeping life sciences researchers always up-to-date with the recent relevant discoveries and facilitating their research in numerous fields such as molecular biology, biomedical engineering, bioinformatics, genetics engineering and biochemistry. My research has almost exclusively focused on biomedical relation and event extraction tasks. These foundational information extraction tasks deal with automatic detection of biological processes, interactions and relations described in the biomedical literature. Precisely speaking, biomedical relation and event extraction systems can scan through a vast amount of biomedical texts and automatically detect and extract the semantic relations of biomedical named entities (e.g. genes, proteins, chemical compounds, and diseases). The structured outputs of such systems (i.e., the extracted relations or events) can be stored as relational databases or molecular interaction networks which can easily be queried, filtered, analyzed, visualized and integrated with other structured data sources. Extracting biomolecular interactions has always been the primary interest of BioNLP researcher because having knowledge about such interactions is crucially important in various research areas including precision medicine, drug discovery, drug repurposing, hypothesis generation, construction and curation of signaling pathways, and protein function and structure prediction. State-of-the-art relation and event extraction methods are based on supervised machine learning, requiring manually annotated data for training. Manual annotation for the biomedical domain requires domain expertise and it is time-consuming. Hence, having minimal training data for building information extraction systems is a common case in the biomedical domain. This demands development of methods that can make the most out of available training data and this thesis gathers all my research efforts and contributions in that direction. It is worth mentioning that biomedical natural language processing has undergone a revolution since I started my research in this field almost ten years ago. As a member of the BioNLP community, I have witnessed the emergence, improvement– and in some cases, the disappearance–of many methods, each pushing the performance of the best previous method one step further. I can broadly divide the last ten years into three periods. Once I started my research, feature-based methods that relied on heavy feature engineering were dominant and popular. Then, significant advancements in the hardware technology, as well as several breakthroughs in the algorithms and methods enabled machine learning practitioners to seriously utilize artificial neural networks for real-world applications. In this period, convolutional, recurrent, and attention-based neural network models became dominant and superior. Finally, the introduction of transformer-based language representation models such as BERT and GPT impacted the field and resulted in unprecedented performance improvements on many data sets. When reading this thesis, I demand the reader to take into account the course of history and judge the methods and results based on what could have been done in that particular period of the history

UTUPub

Potent pairing: ensemble of long short-term memory networks and support vector machine for chemical-protein relation extraction

Author: Farrokh Mehryary
Filip Ginter
Jari Björne
Tapio Salakoski
Publication venue: 'Oxford University Press (OUP)'
Publication date: 28/10/2022
Field of study

Biomedical researchers regularly discover new interactions between chemical compounds/drugs and genes/proteins, and report them in research literature. Having knowledge about these interactions is crucially important in many research areas such as precision medicine and drug discovery. The BioCreative VI Task 5 (CHEMPROT) challenge promotes the development and evaluation of computer systems that can automatically recognize and extract statements of such interactions from biomedical literature. We participated in this challenge with a Support Vector Machine (SVM) system and a deep learning-based system (ST-ANN), and achieved an F-score of 60.99 for the task. After the shared task, we have significantly improved the performance of the ST-ANN system. Additionally, we have developed a new deep learning-based system (I-ANN) that considerably outperforms the ST-ANN system. Both ST-ANN and I-ANN systems are centered around training an ensemble of artificial neural networks and utilizing different bidirectional Long Short-Term Memory (LSTM) chains for representing the shortest dependency path and/or the full sentence. By combining the predictions of the SVM and the I-ANN systems, we achieved an F-score of 63.10 for the task, improving our previous F-score by 2.11 percentage points. Our systems are fully open-source and publicly available. We highlight that the systems we present in this study are not applicable only to the BioCreative VI Task 5, but can be effortlessly re-trained to extract any types of relations of interest, with no modifications of the source code required, if a manually annotated corpus is provided as training data in a specific file format.</p

UTUPub

Proceedings of the 4th BioNLP Shared Task Workshop

Author: Farrokh Mehryary
Filip Ginter
Jari Bjorne
Sampo Pyysalo
Tapio Salakoski
Publication venue: Stroudsburg
Publication date: 28/10/2022
Field of study

We present the TurkuNLP entry to the BioNLP Shared Task 2016 Bacteria Biotopes event extraction (BB3-event) subtask. We propose a deep learning-based approach to event extraction using a combination of several Long Short-Term Memory (LSTM) networks over syntactic dependency graphs. Features for the proposed neural network are generated based on the shortest path connecting the two candidate entities in the dependency graph. We further detail how this network can be efficiently trained to have good generalization performance even when only a very limited number of training examples are available and part-of-speech (POS) and dependency type feature representations must be learned from scratch. Our method ranked second among the entries to the shared task, achieving an F-score of 52.1% with 62.3% precision and 44.8% recall</p

UTUPub

Neural Network and Random Forest Models in Protein Function Prediction

Author: Björne Jari
Ginter Filip
Hakala Kai
Kaewphan Suwisa
Mehryary Farrokh
Moen Hans
Salakoski Tapio
Tolvanen Martti
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/10/2022
Field of study

Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.</p

UTUPub

The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest

Author: Bork Peer
Doncheva Nadezhda T
Fang Tao
Gable Annika L
Hachilif Radja
Jensen Lars J
Kirsch Rebecca
Koutrouli Mikaela
Mehryary Farrokh
Nastou Katerina
Pyysalo Sampo
Szklarczyk Damian
von Mering Christian
Publication venue: 'Oxford University Press (OUP)'
Publication date: 06/01/2023
Field of study

Much of the complexity within cells arises from functional and regulatory interactions among proteins. The core of these interactions is increasingly known, but novel interactions continue to be discovered, and the information remains scattered across different database resources, experimental modalities and levels of mechanistic detail. The STRING database (https://string-db.org/) systematically collects and integrates protein-protein interactions-both physical interactions as well as functional associations. The data originate from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources. All of these interactions are critically assessed, scored, and subsequently automatically transferred to less well-studied organisms using hierarchical orthology information. The data can be accessed via the website, but also programmatically and via bulk downloads. The most recent developments in STRING (version 12.0) are: (i) it is now possible to create, browse and analyze a full interaction network for any novel genome of interest, by submitting its complement of encoded proteins, (ii) the co-expression channel now uses variational auto-encoders to predict interactions, and it covers two new sources, single-cell RNA-seq and experimental proteomics data and (iii) the confidence in each experimentally derived interaction is now estimated based on the detection method used, and communicated to the user in the web-interface. Furthermore, STRING continues to enhance its facilities for functional enrichment analysis, which are now fully available also for user-submitted genomes

ZORA

Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task

Author: Abeed Sarker
Anthony Rios
Berry de Bruijn
Debanjan Mahata
Farrokh Mehryary
Filip Ginter
Goran Nenadic
Graciela Gonzalez-Hernandez
Jasper Friedrichs
Kai Hakala
Maksim Belousov
Ramakanth Kavuluru
Saif M. Mohammad
Sifei Han
Svetlana Kiritchenko
Tung Tran
Publication venue: 'Oxford University Press (OUP)'
Publication date: 28/10/2022
Field of study

Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data.Materials and Methods: We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks.Results: Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems.Discussion: Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1).Conclusions: Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).</div

UTUPub

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Author: Almeida-e-Silva Danillo C.
Altenhoff Adrian
Babbitt Patricia C.
Bankapur Asma R.
Bargsten Joachim W.
Ben-Hur Asa
Benso Alfredo
Bhat Prajwal
Bkc Dukka
Bonneau Richard
Brenner Steven E.
Bryson Kevin
Cao Renzhi
Casadio Rita
Cejuela Juan M.
Chapman Samuel
Chen Ching-Tai
Cheng Jianlin
Cibrian-Uhalte Elena
Clark Wyatt T.
Cozzetto Domenico
D'Andrea Daniel
Das Sayoni
Dawson Natalie L.
del Pozo Angela
Denny Paul
Dessimoz Christophe
Di Carlo Stefano
Dogan Tunca
ElShal Sarah
Falda Marco
Fang Hai
Feng Shou
Fernández José M.
Ferrari Carlo
Fontana Paolo
Foulger Rebecca E.
Friedberg Iddo
Funk Christopher S.
Gabaldon Toni
Gemovic Branislava
Gillis Jesse
Ginter Filip
Giollo Manuel
Glisic Sanja
Goldberg Tatyana
Gong Qingtian
Gough Julian
Greene Casey S.
Hakala Kai
Hamp Tobias
Hieta Reija
Holm Liisa
Hsu Wen-Lian
Huntley Rachael P.
Jiang Yuxiang
Jones David T.
Kaewphan Suwisa
Kahanda Indika
Kansakar Lakesh
Khan Ishita K.
Kihara Daisuke
Koo Da Chen Emily
Koskinen Patrik
Lavezzo Enrico
Lee David
Lees Jonathan G.
Legge Duncan
Lepore Rosalba
Li Biao
Lin Alexandra
Linial Michal
Lovering Ruth C.
Magrane Michele
Maietta Paolo
Marcet-Houben Marina
Martelli Pier Luigi
Martin Maria J.
Mehryary Farrokh
Melidoni Anna N.
Mesiti Marco
Minneci Federico
Mooney Sean D.
Moreau Yves
Mutowo-Meullenet Prudence
Nepusz Tamás
Ning Wei
O'Donovan Claire
Oates Matt
Ofer Dan
Orengo Christine A.
Oron Tal Ronnen
Paccanaro Alberto
Pavlidis Paul
Penfold-Brown Duncan
Perovic Vladmir
Pichler Klemens
Piovesan Damiano
Politano Gianfranco
Profiti Giuseppe
Radivojac Predrag
Rappoport Nadav
Re Matteo
Rehman Hafeez Ur
Richter Lothar
Robinson Peter N.
Romero Alfonso E.
Rost Burkhard
Sahraeian Sayed M.E.
Salakoski Tapio
Salamov Asaf
Sasidharan Rajkumar
Savino Alessandro
Sedeño-Cortés Adriana E.
Sharan Malvika
Shasha Dennis
Shypitsyna Aleksandra
Sillitoe Ian
Skunca Nives
Smithers Ben
Stern Amos
Sternberg Michael J.E.
Supek Fran
Tian Weidong
Toppo Stefano
Tosatto Silvio C.E.
Tramontano Anna
Tranchevent Léon-Charles
Tress Michael L.
Törönen Petri
Valencia Alfonso
Valentini Giorgio
van Dijk Aalt D.J.
Veljkovic Nevena
Veljkovic Veljko
Vencio Ricardo ZN
Verspoor Karin M.
Vogel Jörg
Vucetic Slobodan
Wang Zheng
Wass Mark N.
Yang Haixuan
Youngs Noah
Zakeri Pooya
Zhang Shanshan
Zhong Zhaolong
Zhou Yuanpeng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent. Keywords: Protein function prediction, Disease gene prioritizationpublishedVersio

Brage HiM

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Author: Alborzi Seyed Ziaeddin
Altenhoff Adrian
Amezola Miguel
Antczak Magdalena
Aridhi Sabeur
Asgari Ehsaneddin
Atalay Volkan
Babbitt Patricia C.
Barot Meet
Ben-Hur Asa
Benso Alfredo
Bergquist Timothy R.
Berselli Michele
Bhat Prajwal
Björne Jari
Black Gage S.
Boecker Florian
Bonneau Richard
Borukhov Itamar
Bosco Giovanni
Boudellioua Imane
Brackenridge Danielle A.
Brenner Steven E.
Cao Renzhi
Carraro Marco
Casadio Rita
Cetin-Atalay Rengul
Chandler Caleb
Chang Jia-Ming
Cheng Jianlin
Chi Po-Han
Cozzetto Domenico
Crocker Alex W.
Dai Suyang
Dalkiran Alperen
Das Sayoni
Davidović Radoslav S.
Davis Larry
Dayton Jonathan B.
Dessimoz Christophe
Devignes Marie-Dominique
Di Carlo Stefano
Dogan Tunca
Dzeroski Saso
Emily Koo Da Chen
Fa Rui
Fabris Fabio
Falda Marco
Fang Hai
Fernández José M.
Fontana Paolo
Frank Yotam
Frasca Marco
Freddolino Peter L.
Freitas Alex A.
Friedberg Iddo
Gemovic Branislava
Georghiou George
Ginter Filip
Gligorijević Vladimir
Goldberg Tatyana
Gough Julian
Greene Casey S.
Grossi Giuliano
Hakala Kai
Hamid Md Nafiz
Hoehndorf Robert
Hogan Deborah A.
Holm Liisa
Hou Jie
Hou Jie
Hurto Rebecca L.
Jain Aashish
Jeffery Constance J.
Jiang Yuxiang
Jo Dane
Johnson Devon
Jones David T.
Kacsoh Balint Z.
Kaewphan Suwisa
Kahanda Indika
Kihara Daisuke
Kulmanov Maxat
Larsen Dallas J.
Lavezzo Enrico
Lee Alexandra J.
Lees Jonathan Gill
Lewis Kimberley A.
Liao Wen-Hung
Lichtarge Olivier
Linial Michal
Liu Yi-Wei
Mao Qizhong
Martelli Pier Luigi
Martin Maria J.
McGuffin Liam
McHardy Alice C.
Medlar Alan J.
Mehryary Farrokh
Mesiti Marco
Moen Hans
Mofrad Mohammad R. K.
Mooney Sean D.
Nguyen Huy N.
Notaro Marco
Novikov Ilya
Omdahl Ashton R.
Orengo Christine A.
O’Donovan Claire
Paccanaro Alberto
Pascarelli Stefano
Perovic Vladimir R.
Petrini Alessandro
Piovesan Damiano
Politano Gianfranco
Profiti Giuseppe
Radivojac Predrag
Re Matteo
Reeb Jonas
Rehman Hafeez Ur
Renaux Alexandre
Rifaioglu Ahmet S.
Ritchie David W.
Roche Daniel B.
Rodriguez Jose Manuel
Romero Alfonso E.
Rose Peter W.
Rost Burkhard
Sagers Luke W.
Saidi Rabie
Salakoski Tapio
Savojardo Castrense
Sillitoe Ian
Suh Erica
Sumonja Neven
Supek Fran
Thurlby Natalie
Tian Weidong
Tolvanen Martti E. E.
Toppo Stefano
Torres Mateo
Tosatto Silvio C. E.
Tress Michael L.
Tseng Wei-Cheng
Törönen Petri
Valentini Giorgio
Veljkovic Nevena
Vesztrocy Alex Wiarwick
Vidulin Vedrana
Vucetic Slobodan
Wan Cen
Wang Zheng
Wass Mark N.
Wilkins Angela
Yang Haixuan
Yao Shuwei
You Ronghui
Yunes Jeffrey M.
Zhang Chengxin
Zhang Feng
Zhang Shanshan
Zhang Yang
Zhang Zihan
Zhao Chenguang
Zhou Naihui
Zhu Shanfeng
Zosa Elaine
Šmuc Tomislav
Publication venue
Publication date: 01/01/2019
Field of study

Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.Peer reviewe

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

REPISALUD

Archivio istituzionale della ricerca - Università di Padova

Helmholtz Zentrum für Infektionsforschung Repository

Central Archive at the University of Reading

AIR Universita degli studi di Milano

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Repository of the Vinča Nuclear Institute (VinaR)

OpenMETU (Middle East Technical University)

Explore Bristol Research

Deep Blue Documents at the University of Michigan

Archivio istituzionale della ricerca - Fondazione Edmund Mach

HAL Clermont Université

HAL Descartes

Helsingin yliopiston digitaalinen arkisto

Hal-Diderot

Hacettepe University Institutional Repository

Repository for Publications and Research Data

INRIA a CCSD electronic archive server

UCL Discovery

Kent Academic Repository

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Author: Aashish Jain
Adrian Altenhoff
Ahmet S. Rifaioglu
Alan J. Medlar
Alberto Paccanaro
Alessandro Petrini
Alex A. Freitas
Alex W. Crocker
Alex Warwick Vesztrocy
Alexandra J. Lee
Alexandre Renaux
Alfonso E. Romero
Alfredo Benso
Alice C. McHardy
Alperen Dalkıran
Angela Wilkins
Asa Ben-Hur
Ashton R. Omdahl
Balint Z. Kacsoh
Branislava Gemovic
Burkhard Rost
Caleb Chandler
Casey S. Greene
Castrense Savojardo
Cen Wan
Chenguang Zhao
Chengxin Zhang
Christine A. Orengo
Christophe Dessimoz
Claire O’Donovan
Constance J. Jeffery
Da Chen Emily Koo
Daisuke Kihara
Dallas J. Larsen
Damiano Piovesan
Dane Jo
Daniel B. Roche
Danielle A. Brackenridge
David T. Jones
David W. Ritchie
Deborah A. Hogan
Devon Johnson
Domenico Cozzetto
Ehsaneddin Asgari
Elaine Zosa
Enrico Lavezzo
Erica Suh
Fabio Fabris
Farrokh Mehryary
Feng Zhang
Filip Ginter
Florian Boecker
Fran Supek
Gage S. Black
George Georghiou
Gianfranco Politano
Giorgio Valentini
Giovanni Bosco
Giuliano Grossi
Giuseppe Profiti
Hafeez Ur Rehman
Hai Fang
Haixuan Yang
Hans Moen
Heiko Schoof
Huy N. Nguyen
Ian Sillitoe
Iddo Friedberg
Ilya Novikov
Imane Boudellioua
Indika Kahanda
Itamar Borukhov
Jari Björne
Jeffrey M. Yunes
Jia-Ming Chang
Jianlin Cheng
Jie Hou
Jonas Reeb
Jonathan B. Dayton
Jonathan Gill Lees
Jose Manuel Rodriguez
José M. Fernández
Julian Gough
Kai Hakala
Kimberley A. Lewis
Larry Davis
Liam J. McGuffin
Liisa Holm
Magdalena Antczak
Marco Carraro
Marco Falda
Marco Frasca
Marco Mesiti
Marco Notaro
Maria J. Martin
Marie-Dominique Devignes
Mark N. Wass
Martti E.E. Tolvanen
Mateo Torres
Matteo Re
Maxat Kulmanov
Md Nafiz Hamid
Meet Barot
Michael L. Tress
Michal Linial
Michele Berselli
Miguel Amezola
Mohammad R.K. Mofrad
Naihui Zhou
Natalie Thurlby
Neven Sumonja
Nevena Veljkovic
Olivier Lichtarge
Paolo Fontana
Patricia C. Babbitt
Peter L. Freddolino
Peter W. Rose
Petri Törönen
Pier Luigi Martelli
Po-Han Chi
Prajwal Bhat
Predrag Radivojac
Qizhong Mao
Rabie Saidi
Radoslav S. Davidović
Rebecca L. Hurto
Rengul Cetin Atalay
Renzhi Cao
Richard Bonneau
Rita Casadio
Robert Hoehndorf
Ronghui You
Rui Fa
Sabeur Aridhi
Saso Dzeroski
Sayoni Das
Sean D. Mooney
Seyed Ziaeddin Alborzi
Shanfeng Zhu
Shanshan Zhang
Shuwei Yao
Silvio C.E. Tosatto
Slobodan Vucetic
Stefano Di Carlo
Stefano Pascarelli
Stefano Toppo
Steven E. Brenner
Suwisa Kaewphan
Suyang Dai
Tapio Salakoski
Tatyana Goldberg
Timothy R. Bergquist
Tomislav Šmuc
Tunca Dogan
Vedrana Vidulin
Vladimir Gligorijević
Vladimir R. Perovic
Volkan Atalay
Wei-Cheng Tseng
Weidong Tian
Wen-Hung Liao
Yang Zhang
Yi-Wei Liu
Yotam Frank
Yuxiang Jiang
Zheng Wang
Zihan Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/10/2022
Field of study

BackgroundThe Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function.ResultsHere, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory.ConclusionWe conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.</p

UTUPub

Optimization and Applications of Large-Scale Biomedical Event Networks

Author: Mehryary Farrokh
Publication venue: fi=Turun yliopisto|en=University of Turku|
Publication date: 01/06/2016
Field of study

The overwhelming amount and unprecedented speed of publication in the biomedical domain make it difficult for life science researchers to acquire and maintain a broad view of the field and gather all information that would be relevant for their research. As a response to this problem, the BioNLP (Biomedical Natural Language Processing) community of researches has emerged and strives to assist life science researchers by developing modern natural language processing (NLP), information extraction (IE) and information retrieval (IR) methods that can be applied at large-scale, to scan the whole publicly available biomedical literature and extract and aggregate the information found within, while automatically normalizing the variability of natural language statements. Among different tasks, biomedical event extraction has received much attention within BioNLP community recently. Biomedical event extraction constitutes the identification of biological processes and interactions described in biomedical literature, and their representation as a set of recursive event structures. The 2009–2013 series of BioNLP Shared Tasks on Event Extraction have given raise to a number of event extraction systems, several of which have been applied at a large scale (the full set of PubMed abstracts and PubMed Central Open Access full text articles), leading to creation of massive biomedical event databases, each of which containing millions of events. Sinece top-ranking event extraction systems are based on machine-learning approach and are trained on the narrow-domain, carefully selected Shared Task training data, their performance drops when being faced with the topically highly varied PubMed and PubMed Central documents. Specifically, false-positive predictions by these systems lead to generation of incorrect biomolecular events which are spotted by the end-users. This thesis proposes a novel post-processing approach, utilizing a combination of supervised and unsupervised learning techniques, that can automatically identify and filter out a considerable proportion of incorrect events from large-scale event databases, thus increasing the general credibility of those databases. The second part of this thesis is dedicated to a system we developed for hypothesis generation from large-scale event databases, which is able to discover novel biomolecular interactions among genes/gene-products. We cast the hypothesis generation problem as a supervised network topology prediction, i.e predicting new edges in the network, as well as types and directions for these edges, utilizing a set of features that can be extracted from large biomedical event networks. Routine machine learning evaluation results, as well as manual evaluation results suggest that the problem is indeed learnable. This work won the Best Paper Award in The 5th International Symposium on Languages in Biology and Medicine (LBM 2013).Siirretty Doriast

UTUPub